Part I - Ford GoBike System Data

by Shion Kim

Introduction

This data set includes information about individual rides made in a bike-sharing system covering the greater San Francisco Bay area.

Preliminary Wrangling

What is the structure of your dataset?

183412 trips are recorded in this dataframe with 14 columns(duration_sec, start_time, end_time, start_station_id, start_station_name, start_station_latitude, start_station_longitude, end_station_id, end_station_name, end_station_latitude, end_station_longitude, bike_id, user_type, member_birth_year, member_gender, and bike_share_for_all_trip).

There are 2 types of user_type, which are Customer and Subscriber.

Male, Female, and Other are present as a measure to store customer's gender information.

What is/are the main feature(s) of interest in your dataset?

I'm interested in fiding out how long the average trip takes. Also, when are most trips taken in terms of time of day, day of the week, or month of the year? Does the above depend on if a user is a subscriber or customer?

What features in the dataset do you think will help support your investigation into your feature(s) of interest?

I expect that duration_sec, start_time, end_time,and user_type will a big role in this data exploration. There could also be a reationship between 3 genders in this dataset. I expect that the Latitude and user's age plays a big role as well.

Univariate Exploration

I'll start by looking at the distribution of the main variable of interest: duration.

Duration has a long-tailed distribution, with a lot of users on the low duration end, and few on the high duration end. When plotted on a log-scale, the duration distribution looks right-skewed, with one peak between 350 and 390. Interestingly, there's a steep decline in frequency right after 2000.

Next up, the first predictor variable of interest: start_time.

The smaller bin size provides us with a lot more information in general to analyse.

In the case of start_time, the small bin size proves very illuminating. There are very large spikes in frequency at the bars at specific time of the day (e.g. 3PM-9PM); frequency quickly trails off until the next spike. These probably represent the busiest hours in a day.

If we take a look at our data on a dayly basis, there's a spike during the first and third week. Finally, we can conclude that there are signigicantly less users during the weekend with Thursday being the most popular day of all.

I'll now move on to the other variable in the dataset: Gender.

On this platform, this is significantly that almost 71% of users are male. Women consist of 22% of the userbase in total, and approximately 2% of users identify themselves as other.

On top of that, 90% of users are subscribers, while only 10% of them are customers.

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

The duration variable took on a large range of values, so I looked at the data using a log transform. Also, the days of the week and date columns had to be changed to strings so that I could generate the histograms.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

During the weekend, the number of users significantly drops, indicating that there's something that are preventing them from using bikes.

Bivariate Exploration

To start off with, I want to look at the pairwise correlations present between features in the data.

There's a strong positive relationship between end_station_latitude and start_station_latitude. Also between end_station_longitude and start_station_longitude. On the other hand, we can also see a strong negative relationship betwen start_station_longitude and start_station_latitude and end_station_longitude and start_station_latitude, as well as end_station_longitude and end_station_latitude.

We can see that there's no significant differences in these plots. It is possible that users are a bit younger on Tuesdays and Fridays than usual, as the lower side of IQR goes down more than the other days of the week.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Birth Year had a surprisingly high amount of correlation with the duration of the ride. An approximately exponential relationship was observed when duration was plotted. Box plots tell us that there aren't huge differences across the gender of user, and the day of the week.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

There was also an interesting relationship observed between start_time and end_time. start_station_longitude. On the other hand, we can also see a strong negative relationship betwen start_station_longitude and start_station_latitude and end_station_longitude and start_station_latitude, as well as end_station_longitude and end_station_latitude.

Multivariate Exploration

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

I extended my investigation of start time against duration in this section by looking at the impact of the three categorical quality features. The multivariate exploration here showed that there is an increased number of values on birth time when younger, but in the second plot, it is hard to see any relationship from this one.

Were there any interesting or surprising interactions between features?

Looking at the point plots, it doesn't appear that the three category features have a systematic interaction impact. The features, on the other hand, aren't completely self-contained. However, it's fascinating to see how the start time plot for duration relates to the days of the week.

Conclusions

In this data investigation, I expect duration sec, start time, end time, and user type to play a significant role. In this dataset, there could be a relationship between three genders. I believe that the user's age and latitude will also play a significant effect. As I predicted, duration has a long-tailed distribution, with a large number of users on the low end and a small number on the high end. The duration distribution appears right-skewed when shown on a log-scale, with one peak between 350 and 390. Surprisingly, following the year 2000, there is a significant drop in frequency.

The first predictor variable to look at is start time. The small bin size is particularly useful in the case of start time. At various times of the day (e.g. 3PM-9PM), there are big spikes in frequency at the bars, which quickly fade away until the next surge. These are most likely the busiest hours of the day. When we look at our data on a daily basis, we can see that there is a rise in the first and third weeks. Finally, we may deduce that weekend usage is significantly lower, with Thursday being the most popular day of the week.

Next, we'll look at the gender variable in the dataset. The fact that over 71 percent of users on this network are men is important. In all, women make up 22% of the user base, and about 2% of users identify as other. Furthermore, ninety percent of users are subscribers, whereas only ten percent are customers.

Because the duration variable had such a wide range of values, I used a log transform to examine the data. In order to construct the histograms, the days of the week and date columns have to be transformed to strings. The number of users reduces dramatically over the weekend, indicating that something is stopping them from riding their bikes.

The length of the ride was interestingly correlated with the year of birth. When the duration was plotted, an essentially exponential connection was discovered. Box plots show that there aren't many variances depending on the user's gender and the day of the week. There was also a fascinating link between start time and end time, which was called start station longitude. On the other hand, there is a strong negative association between start station longitude and start station latitude, end station longitude and start station latitude, and end station longitude and end station latitude, as well as end station longitude and end station latitude.

Between start time and end time, which is linear, there is a significant positive link.